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Abstract 

We present surrogate regret bounds for arbitrary surrogate losses in 
the context of binary classification with label-dependent costs. Such 
bounds relate a classifier's risk, assessed with respect to a surrogate 
loss, to its cost-sensitive classification risk. Two approaches to surro- 
ga te regret bounds are developed. The first is a direct generalization 



of iBartlett et al.l [20061 ] . who focus on margin-based losses and cost- 
insensitive class ification, while the second adopts the framework of 
Steinwart (20071] based on calibration functions. Nontrivial surrogate 
regret bounds are shown to exist precisely when the surrogate loss sat- 
isfies a "calibration" condition that is easily verified for many common 
losses. We apply this theory to the class of uneven margin losses, and 
characterize when these losses are properly calibrated. The uneven 
hinge, squared error, exponential, and sigmoid losses are then treated 
in detail. 



1 Introduction 



Binary classification is concerned with the prediction of a label Y G {—1, 1} 
from a feature vector X by means of a classifier. A classifier can be rep- 
resented as a mapping x i— > sign(/(x)) where / is a real- valued decision 
function. The goal of classification is to learn / from a training sam- 
ple (Xi,Yi), . . . , (X n ,Y n ). When the cost of misclassifying X is not de- 
pendent on Y, the performance of / is typically measured by the risk 
R(f) = Ex,Y[^{Y^f(x)}}- Since minimization of the empirical risk is usually 
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intractable, it is common in practice to instead minimize the empirical ver- 
sion of the L-risk Rl{I) = Ex,Y [L(Y, f(X))], where L(y,t) is a surrogate 
los s, chosen for its computational qualitie s such as convexity. 

Bartlett. Jordan, and McAuliffd [2006] study conditions under which con- 



sistency with respect to an L-risk implies consistency with respect to the 
original risk R(f). To be more specific, let R* and R* L denote the minimal 
risk and L-risk, respectively, over all possible decision functions. iBartlett et al 
examine when there exists an invertible function 6 with 6(0) = such that 



R(f) - R* < e(R L (f) - R* L 



(1) 



for all / and all distributions on (X, Y). We refer to such a relationship as 
a surrogate regret bound, since R(f) — R* and Rl(I) — R*l are known as 
the regret and sur rogate regret, respectively. 

Bartlett et al.l study margin losses, which have the form L(y, t) = <j)(yt) 
for some (f> : R — )■ [0, oo). They show that non-trivial surrogate regret 
bounds exist precisely when L is classification-calibrated, which is a technical 
condition they develop. 

In this paper we extend the work of lBartlett et al.l in two ways. First, we 



consider risks that account for label-dependent misclassification costs. Sec- 
ond, we study arbitrary surrogate losses, not just margin losses. We show 
that non-trivial surrogate regret bounds exist when L is a-classification cal- 
ibrated, where a £ (0, 1) represents the misclassification cost asymmetry. 
This condition is a natural generalization of classification calibrated. We 
also give results that facilitate the calculation of these bounds, and charac- 
terization of which losses are a-classification calibrated. 



Steinwartl 20071 ] extends the work of IBartlett et al.1 in a very general 



way that encompasses several supervised and unsupervised learning prob- 
lems. He applies this framework to cost-sensitive classification, but restricts 
his attention to margin-based losses. We apply this framework to derive 
surrogate regret bounds for cost-sensitive classification and arbitrary losses. 
The results obtained in this manner are sh own to be equiv alent to the bounds 



obt ained by generalizing t he app roach of lBartlett et al 



Reid and Williamson! |2009al lb| also study a-classification calibrated 



losses and derive surrogate regret bounds for cost-sensitive classification. 
Their focus is on class probability estimation, and unlike the present work, 
they impose certain conditions on the surrogate loss, such as differentiabil- 
ity everywhere. Therefore they do not address important losses such as the 
hinge loss. In addition, their bounds are not in the form of ([1]), but rather 
are stat ed implicitly. We also note t hat their examples of surrogate regret 
bounds Reid and Williamson! . l2009af | consider only margin losses. 
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Additional comparisons to the above cited and other works are given 
later. Because we allow for asymmetry in both the misclassification costs and 



surrogate loss, unlike the original analysis of iBartlett et al.l [20061 ]. certain 
aspects of our analysis are necessarily different. 

A motivation for this work is to understand uneven margin losses, which 
have the form 

L(y,t) = l {y= i } <t>(t) + 1 {2/= _ 1} /3<K-T*) 

for some (j) : R — > [0, oo) and /3, 7 > 0. Various instances of such losses have 
appeared in the literature (see Sec. H] for specific references), primarily as 
a heuristic modification of margin losses to account for cost asymmetry or 
unbalanced datasets. They are computationally attractive because they can 
typically be optimized by modifications of margin-based algorithms. How- 
ever, statistical aspects of these losses have not been studied. We character- 
ize when they are a-classification calibrated and compute explicit surrogate 
regret bounds for four specific examples of (j). 

When applied to uneven margin losses, our work has practical implica- 
tions for adapting well-known algorithms, such as Adaboost and support 
vector machines, to settings with unbalanced data or label-dependent costs. 
These are discussed in the concluding section. 

The rest of the paper is organized as follows. Section [2] develops a general 
framework for surrogate regret bounds that handles label-dependent costs 
and arbitrary surrogate losses. The special case of cost-insensitive classifica- 
tion with general losses is considered, and a refined treatment is also given 
for the case o f convex losses. Section [3] relates our problem to the general 



framework of ISteinwartl 20071 ] . and provides an alternate, yet ultimately 
equivalent approach to surrogate regret bounds using so-called calibration 
functions. Section 0] examines uneven margin losses in detail, including four 
specific instances of 4> corresponding to the hinge, squared error, exponen- 
tial, and sigmoid functions. A concluding discussion is offered in Section [5j 
Supporting lemmas and additional details may be found in two appendices. 



2 Surrogate Losses and Regret Bounds 

Let (X, Y) have distribution P on X x { — 1, 1}. Let J- denote the set of all 
measurable functions / : X — > R. Every / G T defines a classifier by the 
rule x 1 y sign(/(x)). We adopt the convention sign(0) = —1 

A loss is a measurable function L : { — 1, 1} X R — )■ [0, 00). Any loss can 
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be written 

L(V,t) = l{y=l}Ll(t) + l{j /= _ 1 }L_l(t). 

We refer to L\ and L—\ as the partial losses of L. The L-risk of / is 
R L (f) := E x ,y[L(YJ(X))}. The optimal L-risk is R* L := M feT R L (f). 
The cost-sensitive classification loss with cost parameter a £ (0, 1) is 

U a {y,t) := (1 - a)i{j/=i}i{i<o} + al{ y =-i}l{i>o}- 

When L = U a , we write R a (f) and R* a instead of Ru a {f) and Rjj ■ 
Although other parametrizations of cost-sensitive classification losses are 
possible, this one is convenient because an optimal classifier is sign(r/(x) — a) 
where rj(x) := P(Y = 1\X = x). See LemmadJ part 1. We are motivated by 
applications where it is desirable to minimize the C/ Q -risk, but the empirical 
L/ a -risk cannot be optimized efficiently. In such situations it is common 
to minimize the (empirical) L-risk for some surrogate loss L that has a 
computationally desirable property such as differentiability or convexity. 

Define the conditional L-risk 

C L ( V ,t) :=r ? L 1 (t) + (l-7 ? )L_ 1 (t) 

for 7] G [0,1], t E R, and the optimal conditional L-risk C£(r/) = 
inf te ]R Cl(j]-> t) for r\ G [0,1]. These are so-named because Rl{I) = 
Ex[C L {r](X)J{X))} and i?£(r?) = E x [C* L {r,{X))}. Note that we use r, to 
denote both the function rj(x) = P(Y = 1\X = x) and a scalar r\ S [0, 1]. 
The meaning should be clear from context. When L = U a , we write C a (r), t) 
and C*(rj) for Cu a (r],t) and C^ r (n) . Mea s urabi lity issues with these and 



other quantities are addressed in ISteinwarti [2007]. 



This section has three parts. In l2.1l we extend the work of lBartlett et al. 



2006], on surrogate regret bounds for margin losses and cost-insensitive 



classification, to general losses and cost-sensitive classification. In 12.21 we 
specialize our results to the important special case of cost-insensitive classi- 
fication with general losses, and in 12.31 we present some results for the case 
of convex partial losses. 

2.1 a-classification calibration and surrogate regret bounds 

For a 6 (0, 1) and any loss L, define 

for tj G [0, 1], where 

C7(r>) := inf C L (r],t). 
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Note that H L>a (rj) > for all rj G [0, 1]. 

Definition 1. We say L is a-classification calibrated, and write L is a-CC, 
ifH L>ol (rj) >0/or all n G [0,1], 77^ a. 

Intuitively, L is a-CC if, for all x such that n(x) / a, the value of 
t = f(x) minimizing the conditional L-risk has the same sign as the optimal 
predictor n(x) — a. Denote B a := max(a, 1 — a). Note that the regret, 
Ra(f) ~ Rai an d the conditional regret, C a (r],t) — C*(r/), both take value 
in [0, B a ]. This can be seen from Lemma HJ part 1- Next, define 

VL,a{ € ) = min H L,a{rf) 

rje[0,l]:]r?-a|=e 

for e G [0, B a ], Notice that for a < i, 



min(H L}0 ,(a + e),H Lya (a - e)), 0<e<a 
Hl,oi{& + e )i a < e < 1 — a 



and for a > |, 

min(i? LjQ (a + e),H L ^ a (a - e)), < e < 1 - a 



[ HL )Ct {a — e), 1 — a < e < a. 

Finally, define V'L,o( e ) = v Tai e ) f° r e G [0>-^q]> where g** denotes the 
Fenchel-Legendre biconjugate of g. The biconjugate of g is the largest lower 
semi-continuous function that is < g, and is defined by 



Epi#** = coEpi#, 

where Epi g = {(r, s) : g(r) < s} is the epigraph of g, co denotes the convex 
hull, and the bar indicates set closure. Since i>r, a(0) = (Lemma [TJ part 
4), i^L,a is nonnegative, and ipL,a is convex, we know tpL,a(0) = and ipL,a 
is nondecreasing. 

Theorem 1. Let L be a loss and a G (0, 1). 

1. For all f G J- and all distributions P, 

4>L,a(R a (f) - R* a ) < Rdf) - R* L - 

2. ipL,a is invertible if and only if L is a-CC. 



5 



Proof. For the first part, by Lemma [T] part 1 we know 



Ra{f) ~ R* a = Ex[l{ s i g nf(X)jL8i g n(r)(X)-a)}\v( X ) ~ a \] 
< E X [l{f(X)WX)-a)<p}\v(X) ~ a \}- 



Then 



< 



V*L* a (Ra(f) ~ R* a ) < Ex[v*L* a {l{f(X)(r,(X)- a )<0}\v(X) ~ d\ 

E X [vL,a{ l {f{X){r,{X)-a)<0}\ri{ X ) ~ "I)] 
E x[^{f(X)( v (X)- a )<0}^L,a{\ r l(X) - a\)] 

l {f(X)(r 1 (X)-a)<0} , min , _ ,H L ,a(l]') 



< 



Ex 

rf<E[0,l]:\ri'-a\=\7){X)-a\ 
Ex[l{f(X)( v (X)-a)<0}HL,a(j)(X))] 



E X 



i { /(x)(,(x)- Q) <o } ^M^c L { v {x),t)-crMx)) 



< E x [C L ( V (X)J(X))-Ct(r,(X))} 
= R L (f)-R* L . 

The first inequality is Jensen's, and the first equality follows from VL,a{fy = 
0. 

Now consider the second part. If tpL,a is invertible, then ipL,a(e) > 
for all e G [0, B a ], because ipL,a(0) = and ipL,a is nonnegative. Since 
ipL,a < v L,a.i we know VL^ a {e) > for all e G (0,B a ], which by definition of 
UL^ a implies HL )0l (r]) > for all rj ^ a. Thus L is a-CC. 

Conversely, now suppose L is a-CC. We claim that ipL,a(e) > for 
all e G (0, B a ]. To see this, suppose V'L,o( e ) = 0. Since vl >01 is lower 
semi-continuous, Epi i/r a and coEpii/i jQ , are closed sets. Therefore, (e,0) 
is a convex combination of points in Epiz^ iCJ . Since L is a-CC, we know 
^L,a(e) > for all e G (0, B a ]. Therefore e = 0. This proves the claim. 

Since tpL )a (0) = and i\)L,a is convex and nondecr easing, it follows that 
tpL,a is strictly increasing. Since ipL,a is continuous (Lemma [IJ part 5), we 
conclude that tpL,a is invertible. □ 

If L is a-CC, then R a (f) - R* a < ^ a (R L (f) - R* L ). Since ^, a (0) = 
and tpL,a is nondecreasing, the same is true of . As a result, we can show 
that an algorithm that is consistent for the L-risk is also consistent fo r the a 



cost-s ensit ive classification risk. Such an approach was employed by IZhang 



2004] and lSteinwartl 20051 ] to prove consistency, for the cost-insensitive risk, 



of different algorithms based on surrogate losses. 
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Corollary 1. Suppose L is a-CC. 

1. // Riifi) — R*l — > for some sequence of decision functions fi, then 

2. Let f n be a classifier based on the random sample (Xi,Y\), . . . , (X n ,Y n ). 

If RL(fn) — R*l ~~ ^ in probability, then R a {f n ) — R* a — > in probability. 
If Rij{fn) — R*l ~^ with probability one, then R a {fn) — R* a with 
probability one. 

Proof. Since L is a-CC, ipL,a is invertible. For any e G (0, B a ], if Riif) — 
R* L < !&£,«(€), then R a (f) ~ K < ^lU R Uf) ~ R* L ) < Now 1 follows. 

Assume Riifn) — R*l — > in probability. By the above reasoning, if 
R a {f)-R* a > e, then R L {f)-R* L > i/) Lta (e). Therefore, for any e G {0,B a }, 

P(R a (f n ) ~R* a >e)< P(R L (fn) ~R*L> 1>L, a (e)) 

as n — > oo by assumption. 

Assume RlQu) — R\ — > with probability one. By part 1, 

P ( lim R a (f n ) -R* a = 0)>p(]im R L (f n ) - R* L = o) = 1. 

\n— >oo / \n— too / 

Hence R a {f n ) — R* a — > with probability one. □ 

Below in Section HI the above results are made more concrete when we 
examine some specific losses (namely, uneven margin losses). 



2.2 Cost-insensitive classification 

We turn our attention to the cost-insensitive or 0/1 loss, 

U(y,t) := l^ijl^o} + l{ a =-i} 1 {t>o} = 2^1/2(2/, t). 

This loss is not only important in its own right, but the associated quantity 
Hl, defined below, is useful for calculating Hl,u whe n a ^ \, as explained 
below. The results in this section generalize those of Bartlett et al. 2006], 
who focus on margin losses. We place no restrictions on the partial losses 
L\ and L_i. 

For an arbitrary loss L, define 

H L (V) :=C7(v)-CUv) 
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for rj G [0, 1] , where 



C£(rj):= inf C L ( V ,t). 

t:t(2ri-l)<0 



Also define for e G [0, 1] 

^(e) := min 

» ? e[0,l]:|a 7 -l|=e 

= min{# L (±^),# L (^)}. 

Finally, define ip L {e) = v* L *(e) for e G [0, 1]. 

The following definition was introduced by iBartlett et al.l 20061 ] in the 
context of margin losses. 

Definition 2. If H^rj) > for all r) G [0, 1],T) ^ \, L is said to be classifi- 
cation calibrated, and we write L is CC. 



For margin losses, this coincides with the definition of lBartlett et all and 
our Hl equals their ip. Also note that Hl(t]) = H L ,1/2(77), and therefore L 
is CC iff L is |-CC. When L = U, we write R(f),R*,C(ji,t), and C*(rj) 
instead of Ru(f),R^,Cu{T],t), and Cjj(r]), respectively. 

Theorem 2. Let L be a loss. 

1. For any f G T and any distribution P, 

MRU) - R*) < RlU) - Rl- 

2. ipi, is invertible if and only if L is CC. 

Proof. The proof follows from Theorem Q] and the relationships C(rj,t) = 
2C 1/2 ( V ,t), C*( V ) = 2C* /2 ( V ), H L ( V ) = H L>1/2 ( V ), VL {e) = ^,1/2(2), and 
^L(e) = 1/2(5 )• Thus, to prove 1, note 

MR(f)-R*) = ^ V 2{hE x [C{r ] {X)J{X))-C*{r ] {X))]) 

= TpL,l/2(Rl/2(f)-R*l /2 ) 

< R L (f)-R* L - 

To prove 2, note ipL is invertible iPl,i/2 i s invertible <?=^ L is ^-CC 

^ L is CC. □ 
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When L is a margin loss, Hl is symmetric with respect to rj = |, 
and the above resul t reduces to the surrogate regret bound established by 
Bartlett et all [200d ]. 



The following extends a result for margin losses noted by ISteinwart 



2007]. For any loss L, we can express Hi a in terms of Hl- This sim- 



plifies the determination of Hi, }a ,VL >a , and ij)L,a- 

Given the loss L(y, t) = l{ 3/=1 }Li(t) + l{ y= _ 1 }L_i(t) and a G (0,1) define 

L a (y,t) := (1 - a)l {y=1 }Li(t) + al {3/= _ 1} L_i(i). 

Also introduce w a (rj) = (1 — a) 77 + a(l — 77) and 

tf a(?7 ) = (X -<*)*) 

(1 - a)ry + a(l - 77) 
Theorem 3. For any loss L and any a G (0, 1), 

1. For aZZ 77 G [0,1], 

H La , a (v)= w a (v)H L (Mv))- (4) 

2. L is CC L a is a-CC. 

3. L is a-CC L^ a is CC. 

Proof. Notice that w a {rj) > for all 77 G [0,1], and 2 , d a (rj) — 1 = (77 — 
a)/w a (rj). Thus sign(2$ a (r7) — 1) = sign(77 — a). In addition, -d a : [0, 1] — > 
[0, 1] is a bijection. To prove 1, observe 

ClAvX) = (1- a^Lxit) + a{\ -7/)L_i(t) 

= ^(7?)K(77)L 1 (t) + (1 - 7? a (77))L_!(i)] 

= w a (ri)C L (0 a (v),t)' 
Therefore C£ q = w a (ri)C L (-& a (ri)) and 

C~ (77) = inf C La (77,t) 

ieM:t(r)-a)<0 

= »aW , inf C L (& a (ri),t) 

t:*(2l? Q (7})-l)<0 

= w a (ri)C£(0 a (Ti),t). 

Therefore 

#L Q , a (r?) = C-^(77)-Cl a (77) 

= ^(77)^(^(77)) -C2(^(77))] 
= w a {ri)HL{-& a (ri))- 
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The second statement follows from 1, the positivity of w a , and the fact that 
■d a is a bijection with $ a (a) = \. 

To prove the third statement, notice (Li_ a ) Q = a(l — a)L. Therefore, 
L is a-CC <^=^> a(l — a)L is a-CC <J=^ (Li-q,)^ is a-CC <J=^ is 
CC, where the last equivalence follows from 2. □ 



2.3 Convex partial losses 



When the partial losses L\ and L_i are convex, we can deduce some conve- 
nient characterizations of a-CC losses. 

Theorem 4. Let L be a loss and a E (0, 1). Assume L\ and L_i are convex 
and differentiable at 0. Then L is a-CC if and only if 



Li(0) < 0,L'_x(0) > 0, and aL[(0) + (1 - a)L'_ x (Q) = 



(5) 



A similar result appears in Reid and Williamsonl 2009b], and when the 
loss is a composite proper loss the results are equivalent. Their result is 
expressed in the context of class probability estimation, while our result is 
tailored directly to classification. Although the proofs are essentially the 
same, our setting allows us to state the result without assuming the loss is 
differentiable everywhere. Thus, it encompasses losses that are not suitable 
for class probability estimation, such as the uneven hinge loss described 
below. We also make an obser vation in the special case whe re a = ^ and L 



is a margin loss, also noted by Reid and Williamsonl 2009b]. Then -^(0) 
b'(0) and L'_ x (0) = -^(O), and P> is equivalent to ^'(0) < 0, the condition 



identified bv lBartlett et al.1 [20061 ] . 



Proof Note that f C L (rj,0) = ^(0) + (1 - ri)L'_ x (Q). Now L is a-CC if 
and only if a (rj) > C^rj) for all r\ E [0, l],rj ^ a, and by convexity of L\ 
and L-i, the latter condition holds if and only if 



r,L[(fl) + (1- 77)^-1 (0) 



< if 7] > a 
> if i] < a 



(6) 



Thus, we must show ([5]) <J=^> ([6]). Assume ([6]) holds. Since r] t-t ^^(0)+ 
(1 — r/)L'_ 1 (0) is continuous, we must have aL'^O) + (1 — a)L'_ 1 (0) = 0. 
L[(0) < follows from © with rj = 1, and L'^O) > follows from © with 

7/ = 0. 

Now suppose ([5]) holds. Then r/ h-> r/L^(0) + (1 — r/)L'_ 1 (0) is an affine 
function with negative slope that outputs when ij = a. Thus © holds. □ 
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The following result facilitates calculation of regret bounds. 
Theorem 5. Assume L\ and L—\ are convex. 

1. If L isa-CC, then C~[ a {rj) = rjLi(0) + (1 — r/)L_i(0) and H^^ is convex. 

2. If L is CC, then C~[(t]) = 77X1(0) + (1 — ry)L_i(0), and Hl is convex. 

Proof. The formulas for C~[ and follow from definitions and convexity 

of L\ and HL, a (v) = ~~ ^2( r ?) ^ s convex because a is afhne 

and C£ is concave (Lemma [TJ part 2). Therefore ii^ = Hl,\I2 is also 
convex. □ 

3 Calibration Functions 

In this section we present an alternative, though ultimately equivalent, ap- 
proach to surrogate regret b ounds. Addition al properties of a-CC losses are 



derived, and connections to [Steinwartl . 120071 ] are established. We begin with 



an alternate definition of a-classification calibrated. 

Definition 3. We say L is a-CC if, for all e > 0,7/ G [0,1], there exists 
5 > such that 

C L ( V ,t)-C* L ( V )<6 C a ( V ,t)-C* a ( V )<e. (7) 
We say L is uniformly a-CC if, for all e > 0, there exists 5 > such that 
V V e[0,l],C L (r,,t)-C* L (r,)<6 =► C a (7 7 ,t)-C*(r ? )< e . (8) 
Recall i? a = max(a, 1 — a). For e G [0, -B Q ] also define 

ML,«(e) := inf H La (e) = inf v L a (e'). 

r)G[0,l]:|r?-a|>e e<e'<B Q 

Theorem 6. Fe£ a G (0, 1). For any loss L, 

1. For alle>0,rj G [0, 1] 

C L (77,t) - < ff r , a (r/) =► C a fa,t) - C*(v) < 

2. For aZZe > 0,r? G [0,1], 

C £ (7/,t) - C£(r?) < M i,a(e) => C a (r/,t) - C^r/) < e. 
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If L is a-CC, then 



3. L is a-CC 

4. L is uniformly a-CC. 

Proof. To prove 1, let e > 0, rj G [0, 1]. In Lemma [H part 1 it is shown that 
C a {n,t) - C*(rf) = l{ s i gn (t)^Biga(r,-a)}\v ~ a \- Tnus , if e > \v ~ «|, the result 
follows. Suppose e < \n — a\. Then C a (n,t) — C*(rj) > e sign(i) 7^ 

sign(r/ — a), and 

F i|Q (77) = inf C L ( V ,t)-Ci( V ) 

teR:t(r]-a)<0 

< inf C^-CM 

inf C L (r),t) - C* L Cri). 

Therefore, if Cifot) - C£(r/) < ^^(r?), then C a (??,t) - C*(7?) < e. 

To prove 2, let e > 0,7/ G [0, 1]. If e > \q — a\, then as in part 1 the result 
follows immediately. If e < \rj — a\, then /JtL )a (e) < HL ta (rj) and the result 
follows from part 1. 

Since uniformly a-CC implies a-CC, 3 follows from 4. To show 4, let 
e > 0. By Lemma [H part 3, B.L,a is continuous on {rj G [0, 1] : \q — a| > e}. 
Thus for e < B a , fJ,L,a(,c) is the infimum of a continuous, positive function 
on a compact set and therefore positive. Taking 5 = l^L,a 

(e), the result 

follows by part 2. If e > B a , the result holds because C a (n,t) — C*(r/) = 
l{sign(i)^sign(j7-a)}l ? 7 _ a \ G [0, -B Q ]. □ 



Steinwartl [20071 ] employs a-CC as the definition of classification cali- 
brated in the case of cost-sensitive classification. Although a-CC implies 
a-CC, the reverse implication is not true as the counterexample L = U a 
demonstrates (perhaps ironically). Under a mild assumption on the partial 
losses, Steinwart's definitions and ours agree. This is part 1 of the follow- 
ing result . Under this same mild assumption, we can also express what 



Steinwart calls the calibration function and uniform calibration function of 



L. These are the quantities 5(e,n) and 5(e) in parts 2 and 3, respectively. 
Theorem 7. Assume L\ and L—\ are continuous at 0. 
1. The following are equivalent: 

(a) L is a-CC 

(b) L is a-CC 
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(c) L is uniformly a-CC 

2. For any e > and rj G [0, 1], the largest S such that ^ holds is 

I H Lja {n), e<\rj-a\. 

3. For any e > 0, £/ie largest 5 such that (0j /io/ds is 

5(e) := ( °°' 6 > 5*' (10) 

Proof. We have already shown (a) implies (b) and (c), and (c) implies (b) 
is obvious, so let us show (b) implies (a). 

If e > and r\ S [0, 1] are such that e < \rj — a\, then rj ^ a, and under 
the continuity assumption we have 

inf C L (rj,t)= inf C L (7],t). 

t£R:t(ri-a)<0 t:sign(t)^sign(»7-a) 

Therefore, from the proof of Theorem [6l part 1, 

H L>a (r,) = inf C L ( V ,t)-C* L ( V ). (11) 

t:C a (ri,t)-C* (r?)>e 

Now assume (b) holds, and let rj £ [0, 1], rj / a. Set e = |jy — a|. Since L is 
a-CC, the right hand side of (fTT|) is positive. Therefore Hi^ a {rj) > which 
establishes (a). 

Now consider part 2. If e > \rj — a\, then C a (r],t) — C*(r?) = 
l{sign(i)^sign(j7-a)}l ? 7 _ «l < e regardless of 5. If e < |r/ - a|, then ([II]) 
holds which establishes the result in this case. 

To prove 3, first consider e > B a . Then C a (n,t) — C*(??) < B a < e 
regardless of 6. Now suppose e < B a . Then {r/ £ [0,1] : |t? — a\ > e} 
is nonempty, and this case now follows from part 2 and the definition of 



An emphasis of ISteinwartl 20071 ] is the relationship between surrogate 
regret bounds and uniform calibration functions. In our setting, Theorem [6] 
part 2 directly implies a surrogate regret bound in terms of fiL,a- 

Theorem 8. Let L be a loss, a £ (0, 1). Then 

vl* a (R a (f) - RM)) < RUf) - Rl- 
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This result is similar to Theorem 2.13 of lSteinwartl 20071 ] and surround- 
ing discussion. While that result holds in a very general setting that spans 
many learning problems, Theorem [8] specializes the underlying principle to 
cost-sensitive classification. 

Proof. By Theorem [6l part 2, we know that Cl(t), t) — C* L {ri) < fJ,L,a( e ) =^ 
C a (r],t) - C*(rf) < e. Given / G T and x G X, let e = C a (r)(x), f(x)) - 
C*(r/(a?)). Then Cl(j](x), f(x)) — C* L {r]{x)) > /Ui ja (e), or in other words 

HL, a (C a (ri(x),f(x)) ~ C* a (v(x))) < C L ( V (x),f(x)) - C£(r?(x)). 
By Jensen's inequality, 

Mi:a(^(/)-^a) < E x \ptf ta (C Q (r,(X),f(X)-(%(r,(X)))] 

< E x [^ a (C a (v(X),f(X))-C:( V (X)))} 

< E X [C L (V(X)J(X))-C*MX))} 
= R L (f)-R* L . 

□ 

Thus, for any loss we have two surrogate regret bounds. In fact, the two 
bounds are the same. 

Theorem 9. Let a G (0, 1). 

1. For any loss L, = v*^ a . 

2. If L\ and L~\ are convex, then HL,a = v L,a- 

Proof. Part 1 follows from Lemma [2J To see the second statement, recall 
that Hi^ a is nonnegative, Hi^ a (a) = (Lemma [Q part 4), and Hl,ol is 
convex (Theorem E]) . Thus Hi )0l {r]) is nondecreasing as \n — a\ grows, and 
the result follows. □ 

Thus VL,a and give two approaches to the same bound. vi,a is 
perhaps simpler to conceptualize, and fj,L,a is connected to the notion of 
uniform calibration. 

4 Uneven Margin Losses 

We now apply the preceding theory to a special class of asymmetric losses. 
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Definition 4. Let <j> : R — >■ [0, oo) and (3, 7 > 0. W^e re/er to £/te losses 



L{y,t) = l {y=1} (t)(t) + l {y =_i}^(-7*) 



and 

L a {y,t) = (1 - a)l {j)=1 j0(t) + al {y= _ 1} /3(j)(-jt) 
as uneven margin losses. 

When j3 = 7 = 1, L in Definition [J] is a conventional margin loss, and 
Lq, can be called an a-weighted margin loss. Since they differ from margin 
losses by a couple of scalar parameters, empirical risks based on uneven 
margin losses can typically be optimized by slightly modified versions of 
margin-based algorithms. 

Before proceeding, we offer a couple of comments on Definition HI First, 
although j3 may appear redundant in L a , it is not. a is fixed at a desired cost 
parameter, and thus is not tunable. Second, there would be no added benefit 
from a loss of the form l{ y= i}4>{l't) + lfy = -xy[3(f)(—jt). We may assume 
7' = 1 without loss of generality since scaling a decision function / by a 
positive constant does not alter the induced classifier. However, alternate 
parametrizations such as l^ y= iy<p((l — p)t) + l/ J) __n/80(— pi), p G (0,1), 
might be desirable in some situations. 

A common motivation for uneven margin losses is classification with 
an unbalanced training data set. In unbalanced data, one class has (often 
substantially) more representation than the other, and margin losses have 
been observed to perform poorly in such situations. Weighted margin losses, 
which have the form a'l^ y= ij<f>(t) + (1 — a')l{ v= -~\}M—t ) : are often used as 



a heuristic for unbalanced data. However, as Isteinwartl |2007t ] notes, there 
is no reason why the a' that yields good performance on unbalanced data 
will be the desired cost parameter a. In other words, this heuristic typically 
results in losses that are not q-CC. 

The parameter 7 offers another means to accommodate unbalanced 
data. Such losses have previously been ex plored in the context of 
specific algorithms, including th e perc eptron [Li et al.l . 120021 ] , boosting 
Masnadi-Shir azi and Vasconcelosl. 120071 1 , and support vector machines 



Yang et al. . 20091 . Li and Shawe- Taylor . 2003 ]. Uneven margins (7 / 1) 



have been found to yield improved empirical performance in classification 
problems involving label-dependent costs and/or unbalanced data. 

Prior work involving uneven margin losses has not addressed the issue 
of whether these losses are CC or q-CC. The following result clarifies the 
issue for convex <b. 
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Corollary 2. Let <p be convex and differentiable at 0, /3, 7 > and let L, 
L a be the associated uneven margin losses as in Definition^ The following 
are equivalent: 

(a) L is CC 

(b) L a is a-CC 

(c) 0=1 and 0'(O) < 0. 

Proof. The equivalence of (a) and (b) follows from Theorem [3j and the 
equivalence of (b) and (c) follows from Theorem [H □ 

This result implies that for any a £ (0, 1) and 7 > 0, 



a 



L a (y,t) = (1 - a)l {y=1} (j)(t) + -l{ v =_i}0(-7*) 

is a-CC provided </> is convex and (p'(0) < 0. Thus, 7 is a parameter that 
can be tuned as needed, such as for unbalanced data, while the loss remains 
a-CC. Figure [T] displays the partial losses for three common and for three 
values of 7. If <f> is not convex, then uneven margin losses can still be a-CC, 
but the necessary relationship between f3 and 7 may be different from that 
given by Corollary [2j An example is given below where <f> is a sigmoid. 

To illustrate the general theory developed in Sec. 2, four examples of 
uneven margin losses, corresponding to different 0, are now considered in 
detail. The first three are convex, while the fourth is not. In each case, the 
primary effort goes in to computing Hl{t]) = C~[(n) — C* L {n). Given Hl, 
Hi a ,a is determined by Eqn. @, and VL a ,a by Eqns. fl2J) and For the 
convex 0, all of which satisfy <^(0) = 1, Cj^(rj) = rj + ^(1 — rj) by Theorem 
El part 2. 



4.1 Uneven hinge loss 

Let (j)[t) = (1 — where (s) + = max(0, s). Then 

1 



L(y,t) = 1 {V=1} (1 - t)+ + l {y= _ 1} -(l + 7*)+ 



and 



C L (n,t) = 7 7 (l-t) + + _2(l+ 7 t)_ 

7 

f r,(l-t), 



t<-i 

^(l_t) + iza(l + 7 t), ^<?<1 
^(1+7*), 



7 

t > 1. 
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L_ 1 (t),y= 0.33333 



10 
8 
6 
4 
2 




hinge 

squared error 
exponential 



L_ 1 (t),Y=1 



1^(1)^=3 





Figure 1: Partial losses of an uneven margin loss, for three common <j) (hinge, 
squared error, and exponential) and three values of 7. 



Since Cl is piecewise linear and continuous, we know C£(r/) is the value of 
CL(iJ,t) when t is one of the two knot locations. Thus 

Ct^) = m in(r 7 (l + I),^(l + 7 )) 



1+7 

and 



mm(r], 1 - rj) 



H L {v) = V+ ^(1 ~V) ~ ^min(?7, 1 -77) 
2*7-1, V>\ 

7 ' 7^2' 

Now Hz Ja ^ a (r]) is given by Eqn. Q, and is given by Eqns. ([2]) and 
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([3]). For the hinge case these expressions simplify considerably: 



H La ,a(v) 



7] — a, i] > a 
-j 1 , r)<a. 



Expressions for VL a ,a are given below. Figure [2] shows Hi a a and VL a ,a for 
three values of a and four values of 7. 

These plots illustrate how f£ Q is sometimes discontinuous at min(a, 1 — 
a). We can characterize when vj Ja a has a discontinuity as follows. From 
Eqn. ©, for a<\, 

( min(e, f ), < e < a 
[e, a < e < 1 — a. 

This is discontinuous at a iff 7 > 1 By Eqn. fl3J), for a > ^, 

/ min(e, i), < e < 1 - a 



-, 1 - a < e < a. 

This is discontinuous at 1— a iff 7 < 1. If a = |, ^L a ,a is never discontinuous. 
In summary, is discontinuous at min(a, 1 — a) iff (a — ^(7 — 1) < 0. 

4.2 Uneven squared error loss 

Now let (j)(t) = (1 - t) 2 . Then 

= 1 {w= i } (1 - tf + 1 {v =_i } ~(1 + It) 2 



and 



2 , 1 — ^z-, , 



c L ( v ,t) = v (i-ty + — -(i+7t) 

7 



The minimizer of Cx(7/,£) is 



2??- 1 



77 + 7(1-7?) 
This yields (after some algebra) 

(I + 7) 2 



C£(77) = C L (77,f 



7 ^ + 7(1-^)' 
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Figure 2: Uneven hinge loss. Hh a o, (left column) and VL a ,a (right col- 
umn) for three values of a and four values of 7. 
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and therefore 



h l(V) =V + -{1-V) 



7 7 77 + 7(1-17) 

Figure [3] show plots of Hi a:a and VL a ,a for various values of a and 7. We 
see again evidence that VL a ,a can be discontinuous at min(a, 1 — a). 

As in the other example, we have not indicated ipL a ,a- Yet it can easily 
be visualized as the largest convex minorant of VL a ,a' In many cases, VL a ,a 
is actually convex and hence equals ipL a ,a- The same comment applies to 
the hinge and exponential examples. 

4.3 Uneven exponential loss 

Now let 4>(t) = e~ t and consider 



L(y,t) = l{y=l>e ' + l{ y =-i}- cV 



e 1 



7 



Then 

is minimized by 
yielding 



7 



t* = In ' ' 



1 + 7 \1 — rjj ' 



cm = c i( ,.o = , (i^) '+ 7 + a - ,) (^) 1X1 . 

Figure H] shows plots of Hi ata and VL a ,a for various a and 7. 
4.4 Uneven sigmoid loss 

Finally we consider a nonconvex </>, namely the sigmoid function </>(i) = 
1/(1 + e t ). For concreteness, we fix 7 = 2 and study 

1 11 
L{y,t) = My=i}Y^—t + l {y=-^2 l + e- 2t ' 

General 7 will be discussed at the end. 

Since <f> is not convex, we cannot conclude L is CC. In fact, we will show 
that L is q-CC for a = (3 + Ay/2)/23 tn 0.37639. 
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1.5 ■ 





1.5 




Figure 4: Uneven exponential loss. -Hz, Q ,a (left column) and Vh a ,a (right 
column) for three values of a and four values of 7. 
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Figure [5] shows 

1 



C L (v,t) =r]—— t + 
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1 + e~* 2 1 + e 

as a function of t, for six different 77. These graphs are useful in under- 
standing C~[ a (rj) and C* L {r]). When 77 < |, it can be shown that Ci(r],t) 
has a single local minimum and a single local maximum. When 7/ > ^, on 
the other hand, CL(rj,t) is strictly decreasing. Let t-{rj) denote the local 
minimizer when r\ < \. This function can be expressed in closed form. See 
Appendix [B] for these and other details. 

First, we determine C£. The infimum of 6^(77, i) over t G R is ei- 
ther CL(r/,t_(r/)) or Cl(t], 00) = (1 — r/)/2. As indicated by Figure [5j 
C L (r),t-(r])) = C L (rj, 00) when 7/ = a = (3 + 4\/2)/23 « 0.37639. See 
Appendix ITU for proof of this fact. When 77 < a, C£(ry) = Cl(i], t-(rj)), and 
when rj > a, C* L {rj) = Cz,(rj, 00) = (1 — r/)/2. Thus, 



Cx0?,t-(r/)), 77 < a 



Next, consider C^ a . When 77 < a, a {rj) is either Cl(tj, 0) = (l + 77)/4 
or Cx(?7, 00) = (1 — 7/)/2. Since < <^=^ 7/ < |, we have C£ a (rf) = 
(1 + t/)/4 for < 77 < ~ and C~£ a {rj) = (1 - 77) /2 if | < 77 < a. When 77 > a, 
Cr, a (»7) = Ci(77,t_(77)) when a < 77 < ±, and 6^(77) = C l (t7,0) = (l+?7)/4 
for 77 > i. In summary, 

f T 2 ' 0<77<ior77>i 

( C L (7?,t_(7?)), a < 77 < i. 

Now HL )(X (r]) = C~[ a (rj) — C* L {r]). See Figure [6] for plots of these quantities. 
This is our first example where Hl a ls n °t convex. 

Finally, the preceding discussion can be extended to arbitrary 7 > 0. 
For every 7 > there is a unique a = 01(7) G (0, 1) such that 

L(y,t) = l {y=1} -Lj + i^jI—L-j (12) 

is a-CC. The relationship between a and 7 is shown in FigureO Calculation 
of this curve is discussed in Appendix [Bj In the appendix we show that 



23 



ti = 0.25 



ti = 0.33333 




TI = 0.4 



11 = 0.5 



0.55 

0.5 
0.45 

0.4 
0.35 

0.3 
0.25 

0.2 L 




-10 



10 




Figure 5: Uneven sigmoid loss with 7 = 2. Cl(t], t) is graphed as a 
function of t for six values of rj. The circles indicate (t-(rj), Cl(t], t-(rj))). 
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0.3 - 




Figure 6: Uneven sigmoid loss with 7 = 2. Plots of HL, a , C La , and C£ 
for a = (3 + 4x72)723 « 0.37639. 



a(i) = 1 — 0(7), which explains the sigmoidal shape of a as a function^ of 
In 7. 

Now suppose a' £ (0, 1) is the desired cost asymmetry. By Theorem [3l 
for L in Eqn. (fT2|). L x _ a ( 7 ) is CC, and therefore £(i_o,(<y)W i s o/-CG. This 
is a family of losses, indexed by 7 > 0, all of which are a'-CC. 



5 Discussion 



The results of iBartlett etaD [20061 ] concerning surrogate regret bounds and 
classification calibration are generalized to label-dependent misclassification 
costs and arbitrary losses. Some differences that emerge in this more general 



x We investigated whether a(j) — 1/(1 + e cln7 ) for some c > 0, but evidently it does 



not. 
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Figure 7: Uneven sigmoid loss. Plot of the unique value of a = 0(7) such 
that the uneven sigmoid loss with parameter 7 > (Eqn. (JT2]) ) is a-CC. 



framework are that HL, a (v) is hi general not symmetric about r\ = |, and 
VL .aic) is po t ential ly discontinuous at e = min(a, 1 — a). The framework 
of ISteinwartl 20071 ] is also applied. Although his notion of calibration is 
not always equivalent to the one adopted here, that approach based on 
calibration functions nonetheless leads to the same surrogate regret bounds. 

The class of uneven margin losses are examined in some detail. We hope 
these results provide guidance to future work with such losses, as our theory 
explains how to ensure a-classification calibration for any margin asymmetry 
parameter 7 > 0. For example, Adaboost is often applied to heavily unbal- 
anced datasets where miscl assification costs are la bel-dependent, such as in 
cascades for face detection Viola and Jonesl . 2002]. It should be possible to 
generalize Adaboost to have an uneven margin (to accommodate unbalanced 
data) while being a-classification calibrated for any a G (0, 1). In particular, 
the uneven exponential loss from Sec. I4.3lcan be optimized by the functiona l 
gradient descent approach. In fact, iMasnadi-Shirazi and Vasconcelosl 20071 ] 
developed such an algorithm for the special case 7 = a/(l — a), but did not 
identify the generalization to arbitrary 7. 

Our t heory also sheds l ight on the support vector machine with uneven 



margin. lYang et al.1 20091 ] describe an implementation of this algorithm, 



but they allow for both (3 and 7 to be free parameters. Our Corollary [2] 
constrains (3 = I/7 for classification calibration, which eliminates a tuning 
parameter. 

In closing, we mention two additional directions for future work. First, 
an interesting problem related to uneven margin losses is that of surrogate 
tuning, which in this case is the problem of tuning the parameter 7 to a 
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particular dataset. Nock and Nielsen 2009( | have recently described a data- 



driven approach to surrogate tuning of classification-calibrated (a = ^) 
losses. Second, our regret bounds should be applicable to proving the cost- 
sensitive consistency of algorithms based on surrogate losses. 
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A Lemmas 



LSC and USC abbreviate lower semi-continuous and upper semi-continuous. 
Lemma 1. Let L be a loss, a G (0, 1), and recall B a = max(a, 1 — a). 

1. (a) For any r] £ [0,1], C*(r]) = C a (7], 7] -a), (b) For any rj G [0, 1], t G R, 

C a (rj,t)-C*(r)) = l^ign^sign^-a)}!^-"!- (c) R* a = R a {r] - a) . (d) 
For any f G T , 

Ra{f) ~ R* a = Ex[l{sign(f(X))^sign( V (X)-a)}\v(. X ) ~ 

2. (a) C2(ry) is concave on [0,1]. (b) (if) is concave on [0, a) and on 

(«,!]• 

3. (a) Cl(t]) is continuous on [0, 1]. (b) C^Jjj) and o,re continuous 

on [0, l]\{a}. (c) If L is a-CC, then C~f and H^ a are continuous 
on [0,1]. 

4. H L>a (a) = v L , a (0) = ML,a(0) = i/>L,a(P) = °- 

5. VL,a an d HL,a are LSC on [0,B a ]. ipL,a is continuous on [0,B a ]. 

Proof. 1. For rj G [0,1], C a (rj,t) = (1 - a)r/l {t < } + a(l - rj)l {t>0} is 
minimized by any t such that sign(i) = sign((l— 01)77— ail— rj)) = sign(r/— a). 
Therefore C a (r/, 77 — a) = C*. This gives (a). It also implies 

C a (rM)-C* (77) 

= (1 - a)r/l {t < 0} + a(l - r/)l {t>0} - [(1 - a)-nl {v < a} + a(l - r/)l {?3>a} ] 

l{sign(t)^sign(r)— a)} I'? 
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which is (b). Part (c) now follows from (a) and R* a = Ex[C^(j](X))] = 
Ex[C a (r](X),r](X) — a)] = R a (j] — a), while (d) follows from (b) and 

R a (f)-K = E x [CMX),f{X))-C* a { V (X))\ 

= ^x[l{sign(/(X))^sign(r,(X)-a)}|??(^) ~ &{]. 

2. Since C* L {rj) = in£teM.r]Li(t) + (l — r])L-i(t), it is the infimum of affine 
functions and therefore concave. For 77 < a, (77) = inft>o Ci(r], t) which 
is also concave by the same reasoning. A similar argument applies when 
77 > a. 

3. Since C* L (n) i s conc ave on [0, 1], it is continuous on (0, 1) by Theorem 
10.1 of lRockafellan [l970( |. By Theorem 10.2 of the same, C* L is LSC at 



and 1. Let us argue that C* L is USC at 1, the case of being similar. Thus, 
let e > and let t e £ R such that L x {t e ) < C* L (l) + §. If L-i(i e ) = 0, 
then for any 77 € [0, 1), C* L (rf) < C L (rj,t £ ) = 7?Li(i e ) < L x (t t ) < C* L {1) + e. 
Suppose L_i(/; e ) > 0. If 77 is such that 1 — 2L ^ t < 77 < 1, then C£(77) < 
77Li(rj e ) + (l-77)L_i(i e ) < C£(l) + e. Thus C* L is USC at 1. This establishes 
(a). ' ^ 

For (b), continuity of C L on [0,1] \{a} follows by a similar argument 
as (a). Continuity of Hl,ch then follows immediately. 

It remains to show that , and hence Hlo.-, is continuous at a when 
L is a-CC. First note that a is LSC at a because C£ a [a) = C* L {a), 

^Lai 7 !) — ^K 7 ?) f° r an ^ e [0) 1]; an d from parts (a) and (b). 

We now show C~[ a is USC at a when L is a-CC. Let e > 0. Since C* L is 
continuous at a, there exists 8' > such that |C£(?7) — C£(a)| < | whenever 
1 77 — a I < 5'. Let 5 a = |min(a, 1 — ot),M = max(Li(0), L_i(0)), and set 
5 = mm(S',8 a , ~ • jfj)- Now suppose \rj — a\ < 5, rj ^ a. Then 

CT,«(»7)-Cr > «(«) = CE, a (v)-C* L (2a- V ) + C* L (2a- V )-C* L (a) 
< CZ,aW-C£(2a-77) + |, 

since |(2a — 77) — a| = 1 77 — a| < S < 5'. Since L is a-CC, there exists 
t* , depending possibly on 77 and e, such that rj*((2a — 77) — a) > and 
C L (2a-r],t*) < C* L {2a -??) + §. We may further stipulate C L (2a -77,**) < 
C^(2a — 77,0) which will be needed later. Notice rj*((2a — 77) — a) > <J=^ 
t*(r] — a) < 0, which is also used later. Now C~£ (77) — C£(2a — 77) < 
Cj a (f|) - Ci(2a - 77, i*) + |. Thus far we have shown C^Jjf) - a (a) < 
C E, a ( 7 l) ~ c U2a - 77, t*) + f for 1 77 - a I < <5, 77 ^ a. 
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Now consider 



Cla(v)-C L (2a-rj,t*)= inf C L ( V ,t) - C L (2a - V ,t*) 

t£R:t(r)—a)<0 

< C L (ri,t*)-C L (2a-ri,t*) 

= rjL^t*) + (1 - v)Li(t*) - [(2a - v)Li(t*) + (1 - (2a - rf))L^{t*)} 

= 2[L 1 (t*)( V -a) + L_ 1 (t*)(a-r J )] 

< 2[L 1 (t*) + L- 1 (t*)]\r J -a\. 

To bound this quantity, observe 

M = max(Li(0),L_i(0)) 

> C L (2a-rj,0) 

> C L {2a-r,X) 

= (2a-»;)Li(0 + (l-(2a-7/))L_i(t*) 

> 2L 1 (f) + iz£L_ 1 (t*) 

> ^(LiCO+L-iCt*)). 

To see the next to last inequality, recall [77 — a| < <5 < <5 a — tt niin(a, 1 — a). 
Then 2a - 77 = a + (a - 7/) > f and 1 - (2a - 77) = 1 - a + (77 - a) > i§2 
We now have C^Jt?) - C L (2a - 77, t*) < ^[77 - a| < §. 

We have shown that for all e > 0, there exist 5 > such that for all 
77 G [0, 1] with 1 77 — a| < 5 and 77 / a, 

^» - C£» < e. 

Therefore Q is USC, and hence continuous, at a. 

4. Hi jCl {a) = because when 77 = a, the infimum defining C^^a) is 
unrestricted. From this we have ^l,»(0) = HL tCe (a) = 0. Since < jUL,a(0) < 
^L,a(0) we deduce Mi.aC ) = °- Finally, ^L, a (0) = because = i/^, 
^l,q(0) = 0, and VL, a is nonnegative. 

5. From 3, Hl >01 is continuous except possibly at a. Therefore f£ )Q is 
continuous except possibly at and b a := min(a, 1 — a). z^ jQ is LSC at 
because ^L, a (0) = and vl,o. is nonnegative. vl,ol is LSC at 6 Q because 
VL,a{ba) = v L,a{b a ) < ^L,a(^o)i which follows from the definition of vl,<x- 
Now lower semi-continuity of fJ,La follows from Lemma [2j □ 



The following result generalizes Lemma A. 7 of ISteinwartl 20071 ] . 
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Lemma 2. Let 5 : [0, B] — > [0, oo) be a lower semi- continuous function with 
5(0) = 0, and define 5(e) = inf e /> e 5(e'). Then 5 is lower semi- continuous 
and 5** =5**. 

Proof. Suppose 5 is not LSC at e G [0,1]. Then there exists r > and 
ei,e2, ...—)■ e such that for i sufficiently large, 5(ei) < 5(e) — r. Since 5 is 
nondecr easing, we may assume q < e for all i. If 5(e%) < 5(e) — r, then there 
exists e[ G [ei,e) such that 5(e' { ) < 5(e) — \ < 5(e) — ^. But e\ — > e, which 
implies 5 is not LSC at e, a contradiction. 

To show 5** = 5**, we need to show coEpiJ = coEpi<5. It suffices to 
show coEpi<5 = coEpi<5. Since 5 < 5, clearly Epi5 C Epi<5 and therefore 
coEpi<5 C coEpiJ. For the reverse inclusion, it suffices to show (e,5(e)) 6 
coEpi<5 for all e G [0,2?]. We may assume e G (0,2?) since 5(0) = 5(0) = 
and 5(B) = 5(B). Thus let e G (0, 2?). Since 5 is LSC, it achieves its infimum 
over a compact set, and hence there exists e' G [e, B] such that 5(e) = 5(e'). 
Since (0,0), (e', ^<5(e)) G Epi(<J), it follows that 

^(e\-J(e)) + ^(0,0) = (e, 5(e)) G coEpi«5, 
as was to be shown □ 



B Uneven Sigmoid Loss Details 

We present a closed form expression for t-(n), and describe how to calculate 
0(7) from Sec. 14.41 

t_ (77) is the value of t that satisfies t < and 

= ^C L ( V ,t) = r,<//(t) - (1 - r))<f>'(-2t). 

Using 4>'(t) = —e t /(\ + e t ) 2 and substituting z = e 4 , z must satisfy 2 G (0, 1) 
and 

z , , z~ 2 



VT-l i = ( X _r ?)" 



(1 + z) 2 v ''(l + z- 2 ) 2 ' 
or equivalently, 2 G (0, 1) is a solution of the quartic equation 

= r]z 4 -(l-n)z 3 + 2(2rj-l)z 2 -(l-r])z + n 

= z 2 (nz 2 - (1 - rj)z + 2(2t] - 1) - (1 - r/)^ 1 + t?^ 2 ) 
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Note z = is not the desired solution, as it corresponds to t = — oo. Let 
w = z + z^ 1 , and observe w 2 = z 2 + 2 + z~ 2 . Then z must satisfy 

= r)(z 2 + z~ 2 ) - (I - r))(z + z~ l ) + 2(2^-1) 
= r?(> 2 - 2) - (1 - r/)u; + 2(2t/ - 1) 
= 7/ti; 2 — (1 — rj)w + 2rj — 1. 



Therefore 



1 - r; + ^(1 - r/) 2 - 8»/(7/ - 1) 



27/ 

We take the positive sign because only it gives a positive z. Now z can be 
recovered from w. Since z 2 — ttjz + 1 = we get 



w 



- \/w 2 - 4 



We take the negative sign as we are seeking the smaller of the two critical 
points. It can be shown (with algebra) that w 2 > 4 <^=^> r/ < \. Finally, 
we have t-(rj) = Inz. 

We now turn to characterization of 0(7). Assume 7 > 1. 0(7) is the 
value of 77 such that 

= C l (t/, 00) = C L (r/, i) = — — t + 



7 v y v y 1 + e* 7 1 + e"T* 

is satisfied by a unique t with — 00 < t < 0. Since Cx(t/, —00) = 
Cl(t/, 00) ?? = we must have 7/ > After substituting 

z = e* and simplifying, we seek 7/ > such that 

r/72; 7 — (1 — r\)z + (7/7 — 1 + 7/) = 

is satisfied for a unique z G (0, 1). That is, we need the curves p v (z) := 7/7Z 7 
and q v {z) := (1 — 7/)z — (7/7 — 1 + 77) to intersect exactly once on (0, 1). Since 
Pri is a strictly increasing convex function and q v is a line with positive slope, 
this can happen in one of three ways: (a) 7^(0) > q v (0) and p n {\) < 9^(1), 
(b) /-^(O) < q v (0) and p v (l) > q v (l), or (c) q v is tangent to p v at some 
z £ (0, 1). (a) requires 77 > 1/(1 + 7) an d 7? < 1/(1 + 7), which is impossible. 
Similarly, (b) is impossible. Thus, we must have p' v (z) = q' n {z) for some 
*€(0,1). 

Summarizing up to this point, we seek 7/ > and z G (0, 1) such that 
7/7Z 7 = (1 — rj)z — (7/7 — 1 + 77) (13) 
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and 



777 2 z 7 1 = 1 — rj. 



Dividing (|13p by (|14p and solving for z gives 

r?7 - 1 + f) 7 



1 — r/ 7 — 1 



(14) 



(15) 



Substituting (TT5j| into (HI]) yields 



7 



2 77 7 - 1 + r? 7 

1 — 7/ 7 — 1 



7-1 



+ 1 



1. 



(16) 



When 7 = 2, this simplifies to a quadratic equation, leading to a(2) = 
(3 + 4\/2)/23. More generally, notice that for 77 > j^—, the left-hand side of 
(fl~6j) is strictly increasing, and thus 77 = 0(7) can be found with a bisection 
search. The case 7 = 1 was treated bv lBartlett et al.1 [200d |. yielding a(l) = 
|. When 7 < 1 we may appeal to symmetry. Let us write Cj^(r],t) to 

indicate the dependence of Cl on 7. It is easily shown that C\[^ (77,7^ = 
7(72(1 — 7/, — t), from which it follows that ct(-) = 1 — 0(7)- 
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